12/5/1 (Item 1 from file: 8) 

DIALOG (R) File 8 : Ei Compendex(R) 

(c) 2004 Elsevier Eng. Info. Inc. All rts. reserv. 

06183619 E.I. No: EIP024 4 7 17 67 13 

Title: Data cache design considerations for the Itaniiim registered 
trademark processor 

Author: Lyon, Terry; Delano, Eric; McNairy, Cameron; Mulla, Dean 
Corporate Source: Hewlett-Packard, Fort Collins, CO, United States 
Conference Title: International Conference on Computer Design (ICCD'02) 

VLSI in Copmuters and Processors 

Conference Location: Freiburg, Germany Conference Date: 

20020916-20020918 

Sponsor: IEEE Computer Society 
E.I. Conference No.: 60016 

Source: Proceedings - IEEE International Conference on Computer Design: 
VLSI in Computers and Processors 2002. p 356-362 
Publication Year: 2002 
CODEN: PIIPE6 
Language: English 

Document Type: CA; (Conference Article) Treatment: G; (General Review) 

Journal Announcement: 0211W1 

Abstract: The second member in the Itanium Processor Family, the Itanium 
•2 processor, was designed to meet the challenge for high performance in 
today's technical and commercial server applications. The Itanium 2 
processor's data cache microarchitecture provides abundant memory 
resources, low memory latencies and cache organizations tuned to for a 
variety of applications. The data cache design provides four memory 
ports to support the many performance optimizations available in the EPIC 
(Explicitly Parallel Instruction Computing) design concepts, such as 
prediction, speculation and explicit prefetching . The three-level cache 

hierarchy provides a 16KB 1-cycle first level cache to support the 
moderate bandwidths needed by integer applications. The second level 
cache is 256KB with a relatively low latency and FP balanced bandwidth to 
support technical applications. The on-chip third level cache is 3MB and 
is designed to provide the low latency and the large size needed by 
commercial and technical applications. 9 Refs. 
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Abstract: The second member in the Itanium Processor Family, the Itanium 
2 processor, was designed to meet the challenge for high performance in 
today's technical and commercial server applications. The Itanium 2 
processor's data cache microarchitecture provides abundant memory 
resources, low memory latencies and cache organizations tuned to for a 
variety of applications. The data cache design provides four memory 
ports to support the many performance optimizations available in the EPIC 
(Explicitly Parallel Instruction Computing) design concepts, such as 
prediction, speculation and explicit prefetching . The three-level cache 

hierarchy provides a 16KB 1-cycle first level cache to support the 
moderate bandwidths needed by integer applications. The second level 
cache is 256KB with a relatively low latency and FP balanced bandwidth to 
support technical applications. The on-chip third level cache is 3MB and 
is designed to provide the low latency and the large size needed by 
commercial and technical applications. 9 Refs. 
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Abstract: We present an effective caching scheme that reduces the 
computing and I/O requirements of a Web search engine without altering 
its ranking characteristics. The novelty is a two - level caching 
scheme that simultaneously combines cached query results and cached 

inverted lists on a real case search engine. A set of log queries are 
used to measure and compare the performance and the scalability of the 
search engine with no cache , with the cache for query results, with 
the cache for inverted lists, and with the two-level cache . 
Experimental results show that the two-level cache is superior, and that 
it allows increasing the maximum number of queries processed per second by 
a factor of three, while preserving the response time. These results are 
new, have not been reported before, and demonstrate the importance of 
advanced caching schemes for real case search engines. 32 Refs. 
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Abstract: We present an architecture that features dynamic multithreading 
execution of a single program. Threads are created automatically by 
hardware at procedure and loop boundaries and executed speculatively on a 
simultaneous multithreading pipeline. Data prediction is used to alleviate 
dependency constraints and enable look ahead execution of the threads. A 
two - level hierarchy significantly enlarges the instruction window. 
Efficient selective recovery from the second level instruction window takes 
place after a mispredicted input to a thread is corrected. The second level 
is slower to access but has the advantage of large storage capacity. We 
show several advantages of this architecture: (1) it minimizes the impact 
of ICache misses and branch mispredictions by fetching and dispatching 
instructions out-of-order, (2) it uses a novel value prediction and 
recovery mechanism to reduce artificial data dependencies created by the 
use of a stack to manage run-time storage, and (3) it improves the 
execution throughput of a superscalar by 15% without increasing the 



execution resources or cache bandwidth, and by 30% with one additional 
ICache fetch port. The speedup was measured on the integer SPEC95 
benchmarks, without any compiler support, using a detailed performance 
simulator. (Author abstract) 15 Refs. 
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Abstract: The solution for solving problems of selecting the cache size 
and depth of cache pipelining that maximizes the performance of a given 
instruction-set architecture is formulated. The solution combines 
trace-driven architectural simulations and the timing analysis of the 
physical implementation of the cache . Increasing cache size tends to 
improve performance but this improvement is limited because cache access 
time increases with its size. This tradeoff results in an optimization 
problem called multilevel optimization, because it requires the 
simultaneous consideration of two levels of machine abstraction: the 
architectural level and the physical implementation level. 22 Refs. 
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Abstract: Multiple processors can work together to speed up single 
applications, but sequential programs must be rewritten to take advantage 
of the extra processors. One way to do this is through automatic 
parallelization with a compiler. Multiprocessors pose especially 
challenging problems for parallelizing compilers. Sufficient work must be 
performed in parallel to overcome processor synchronization and 
communication overhead. Moreover, multiprocessor memory hierarchies are 
complex, containing both shared memory and multiple levels of cache 
memory. Thus, two techniques are essential in obtaining good 
multiprocessor performance for array-based numerical programs: locating 
coarse-grain parallelism and managing multiprocessor memory use. The 
authors describe new technology in the Stanford SUIF compiler that enables 
it to. successfully carry out these techniques. First, a suite of robust 
analysis techniques operate across procedure boundaries to locate 
coarse-grain- parallelism so that large computations can execute 
independently in parallel. Then, to help eliminate cache misses, affine 
partitioning is used to improve processor reuse of data, and data 
permutation and data strip-mining make contiguous the data accessed by each 
processor in the shared address space. When employed in the automatic 
parallelizing compiler, these techniques significantly affect the 
performance of half the NAS and SPECfp95 benchmark suites. (Author 
abstract) 12 Refs, 
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Abstract: A general-purpose image processor (GPIP) consisting of 64 
digital signal processors (DSPs) in a 0.31-m**3 box is proposed to perform 
a wide range of image processing tasks. A high-speed DSP called DSP-i has 



been especially developed for this purpose. It has a highly parallel 
architecture with a two - level instruction hierarchy, multibank cache , 
and multiprocessor interface. The DSP-i machine cycle is 50 ns. A novel 
ring shift register bus architecture offers a flexible structure and an 
efficient data-exchange method for the system. Along with four proposed 
operation modes, it cuts the multiprocessing overhead to as little as 20%. 
The performance of the GPIP is 1000 MOPS (million operations per second) . 5 
Refs. 
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A great majority of optimization works on steel structures deal with 
minimum weight design. But in reality minimum weight design is not 
necessarily a minimum cost design. The price of commercially available 
rolled steel shapes in the market does not necessarily depend only on its 
weight but also on some other factors including demands and grades. In cost 
optimization, however, additional difficulties are encountered. They 
include the definition of cost function and uncertainties and fuzziness 
involved in determining the cost parameters. As a result only a small 
fraction of the structural optimization papers published deal with the 
minimization of the cost. 

In this work a fuzzy discrete multicriteria initial cost optimization 
model has been developed by considering three criteria: { 1 )   minimum 
cost, (2)  minimum weight, and (3)   minimum number of section 
types. This discrete cost optimization procedure is preceded by a fuzzy 
genetic continuous variable minimum weight design as a preliminary design. 
In this design the uncertainty or fuzziness of the AISC code based design 
constraints are considered. Furthermore, the fuzzy multicriteria initial 
cost optimization model is extended to perform a life cycle cost 
optimization of steel structures. 

For optimization of large steel structures more than 99% of 
computation time is spent on the minimum weight design using the fuzzy 
Genetic Algorithm. Sequential processing of this optimization work in a 
single processor is very inefficient resulting in huge numbers of page 
faults and cache misses even in a supercomputer like Origin 2000. In this 
work two bi- level parallel processing algorithms are developed by 
using ( 1 )   data parallel procedure using OpenMP API at inner level 
and (2)  message passing distributed parallel processing procedure by 
using function calls from MPI message passing library at outer level. 
Significant performance enhancement has been obtained in comparison to the 
traditional data parallel or distributed message passing parallel 



processing . 
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IEEE Circuits & Syst. Soc 

Conference Date: 16-18 Sept. 2002 Conference Location: Freiberg, 

Germany 
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Abstract: The second member in the Itanium Processor Family, the Itanium 
2 processor, was designed to meet the challenge for high performance in 
today's technical and commercial server applications. The Itanium 2 
processor's data cache microarchitecture provides abundant memory 
resources, low memory latencies and cache organizations tuned to for a 
variety of applications. The data cache design provides four memory ports 
to support the many performance optimizations available in the EPIC 
(Explicitly Parallel Instruction Computing) design concepts, such as 
predication, speculation and explicit prefetching . The three-level cache 

hierarchy provides a 16KB 1-cycle first level cache to support the 
moderate bandwidths needed by integer applications. The second level 

cache is 256KB with a relatively low latency and FP balanced bandwidth to 
support technical applications. The onchip third level cache is 3MB and 
is designed to provide the low latency and the large size needed by 
commercial and technical applications. (7 Refs) 
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Conference Title: High Performance Computing. 4th International 
Symposium, ISHPC 2002. Proceedings 

Conference Date: 15-17 May 2002 Conference Location: Kansai Science 
City, Japan 

Language: English Document Type: Conference Paper (PA) 

Treatment: Applications (A); Practical (P) 

Abstract: Clustering is a technique for partitioning superscalar 
processor's execution resources to simultaneously allow for more in-flight 
instructions, wider issue width, and more aggressive clock speeds. As 
either the size of individual clusters or the total number of clusters 
increases, the distance to the first level data cache increases as well. 
Although clustering may expose more parallelism by allowing a greater 
number of instructions to be simultaneously analyzed and issued, the 
gains may be obliterated if the latencies to memory grow too large. We 
propose to augment each cluster with a small, fast, simple Level Zero ( 
LO ) data cache that is accessed in parallel with a traditional LI 
data cache . The difference between our solution and other proposed 
caching techniques for clustered processors is that we do not support 
versioning or coherence. This may occasionally result in a load instruction 
that reads a stale value from the LO cache , but the common case is a 
low latency hit in the LO cache . Our simulation studies show that 4 KB, 
2-way set associative LO caches provide a 6.5-12.3% IPC improvement over 
a wide range of processor configurations. (19 Refs) 
Subfile: C 
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Abstract: Modern machines present two challenges to algorithm engineers 
and compiler writers: they have superscalar, super-pipelined structure, and 
they have elaborate memory subsystems specifically designed to reduce 
latency and increase bandwidth. Matrix multiplication is a classical 
benchmark for experimenting with techniques used to exploit machine 
architecture and to overcome the limitations of contemporary memory 
subsystems. This research aims at advancing the state of the art of 
algorithm engineering by balancing instruction level parallelism, two 
levels of data tiling, copying to provably avoid any cache conflicts, 
and prefetching in parallel to computational operations, in order to 
fully exploit the memory bandwidth. Measurements on IBM's RS/6000 4 3P 



workstation show that the resultant matrix multiplication algorithm 
outperforms IBM's ESSL by 6.8-31.8%, is less sensitive to the size of the 
input data, and scales better. We introduce a cache aware algorithm for 
matrix multiplication. We also suggest generic guidelines that may be 
applied to compute intensive algorithm to efficiently utilize the data 
cache . We believe that some of our concepts may be embodied in compilers. 
(17 Refs) 
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Abstract: Modern machines present two challenges to algorithm engineers 
and compiler writers: they have superscalar, super-pipelined structure, and 
they have elaborate memory subsystems specifically designed to reduce 
latency and increase bandwidth. Matrix multiplication is a classical 
benchmark for experimenting with techniques used to exploit machine 
architecture and to overcome the limitations of contemporary memory 
subsystems. This research aims at advancing the state of the art of 
algorithm engineering by balancing instruction level parallelism, two 
levels of data tiling, copying to provably avoid any cache conflicts, 
and prefetching in parallel to algorithmic operations, in order to 
fully exploit the memory bandwidth. Measurements show that the resultant 
matrix multiplication algorithm outperforms IBM's ESSL by 6.8-31.8%, is 
less sensitive to the size of the input data, and scales better. The 
techniques presented in the paper have been developed specifically for 
matrix multiplication. However, they are quite general and may be applied 
to other numeric algorithms. We believe that some of our concepts may be 
generalized to be used as compile-time techniques. (17 Refs) 
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Abstract: By gathering multiple processors in one LSI chip, communication 
delay between processors becomes shorter and then efficient fine/medium 
grain parallel processing can be realized. The authors propose a new 
processor architecture called OCMP (On-Chip Multi-Processing 

Architecture). OCMP has two characteristics: one is the instruction 

level dispatching mechanism; and the other is the divided cache system. 
OCMP employs a fork- join type parallel processing model in order to 
simplify the dispatching mechanism. By dividing the cache system into 
shared cache and private cache , the cache coherence problem between 
processors on the same chip is removed and access conflict on the shared 

cache is also relaxed. OCMP is evaluated with the instruction level 
simulator developed by the authors. Two types of instruction level 
dispatching mechanisms are compared. The memory access mechanism is 
evaluated with various parameters such as memory access cost, the degree of 
simultaneous access to shared cache , and so on. (9 Refs) 
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ABSTRACT: 

In this paper, the authors discuss a new approach to multiversion 
concurrenc3y control that allows high-performance transaction systems to 
support the on-line execution of long-running queries (e.g., for decision 
support purposes) . Long-running queries are typically run at level 1 
or level 2 consistency because they introduce a high level of data 
contention with two-phase locking. Multiversion algorithms have been 
discussed as a way to reduce the level of data contention and at the same 
time support the serializable execution of queries. The approach extends 
the multiversion locking algorithm developed by Computer Corporation of 
America by using record-level versioning and reserving a portion of each 
data page for caching prior versions that are potentially needed for the 
serializable execution of queries; on-page caching also enables an 
efficient approach to garbage collection of old versions. In addition, the 
authors introduce view sharing, which has the potential for reducing the 
cost of versioning by grouping together queries to run against the same 
transaction-consistent view of the database. Finally, they also present 
results from a simulation study that compares their approach versus that of 
providing level 1 or level 2 consistency for queries. The reults indicate 
that the authors approach is a viable alternative to reduced-consistency 
locking when the portion of each data reserved for prior versions is chosen 
appropriately . 



